A11.TEX[162,PHY] - www.SailDart.org

perm filename A11.TEX[162,PHY] blob sn#853007 filedate 1988-02-04 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00003 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	%added to a12
C00015 00003	\proclaim Lemma. For $x>0${}
C00020 ENDMK
C⊗;
%added to a12
%see next page (tossed on 9-30-86)
\magnification\magstephalf
\input macro.tex
\def\today{\ifcase\month\or
  January\or February\or March\or April\or May\or June\or
  July\or August\or September\or October\or November\or December\fi
  \space\number\day, \number\year}
\baselineskip 14pt
\rm
\line{\sevenrm a11.tex[162,phy] \today\hfill}

\parskip 5pt
\bigskip
\line{\bf Mean and Variance of Cookie Monster Tail Size\hfill}

In coalesced hashing with table size~$h$, containing $t$~keys, let $c$
be a pointer to the first word of the cellar region. Initially,
$c=h+1$. There are $h+1-c$ cells in the cellar. The other $c-1$
cells include $h-t$ empty cells, randomly distributed. When another key
is inserted, there is probability $t/h$ of a collision. When a collision
occurs, $c$~is moved left to an empty cell, where the new key is inserted.
Let $i$ be the amount by which $c$ moves.

In what follows, we analyze the mean and variance of~$c$ after $t$~insertions.
Much of the analysis is only needed for the variance.

\proclaim Table Search Lemma. A table has size $s$, and contains $u$~unused
cells in random positions. Define $i$ as the index of the first empty cell.
That value has probability $\left.{s-i\choose u-1}\right/{s\choose u}$,
so 
$$E(i↑{\bar{p}})=p!\sum↓i{{i-p-1\choose p}{s-i\choose u-1}\over
{s\choose u}}=p!\,{{s+p\choose u+p}\over{s\choose u}}={p!(s+1)↑{\bar{p}}\over
(u+1)↑{\bar{p}}}\,.$$
In coalesced hashing, $s=c-1$, $u=h-t$, so
$$\eqalign{E(i\mid c)&={c\over h-t+1}\cr
\noalign{\smallskip}
E(i↑{\bar{2}}\mid c)&={2c↑{\bar{2}}\over (h-t+1)↑{\bar{2}}}\cr
\noalign{\smallskip}
E(i↑2\mid c)&={c(2c-h+t)\over (h-t+1)↑{\bar{2}}}\,.\cr}$$
$$\eqalign{E(i\mid c)&={c\over h-t+1}\,;\cr
\noalign{\smallskip}
E(i↑{\bar{2}}\mid c)&={2c↑{\bar{2}}\over (h-t+1)↑{\bar{2}}}\,;\cr
\noalign{\smallskip}
E(i↑2\mid c)&={c(2c-h+t)\over (h-t+1)↑{\bar{2}}}\,.\cr}$$

At the next insertion, $c$~is decreased, with probability $t/h$,
by~$i$, where the statistics of~$i$ are given above. Let $c'$ be the
new value of~$c$, after the $t+1↑{\rm st}$ insertion.
$$\eqalign{E(c'\mid c)&=c-{t\over h}E(i\mid c)=c{(h+1)(h-t)\over h(h-t+1)}\,;\cr
\noalign{\smallskip}
E(c')&=E\bigl(E(c'\mid c)\bigr)=E(c){(h+1)(h-t)\over h(h-t+1)}\,.\cr}$$
Letting $x↓t$ be $E(c)$ at time $t$,
$$\eqalignno{x↓0&=h+1\cr
x↓{j+1}&=x↓j{(h+1)(h-j)\over h(h-j+1)}\cr
\noalign{\hbox{therefore}}
E(c)&=\left({h+1\over h}\right)↑t(h-t+1)\,.\cr}$$
This solves Knuth's problem 6.4--41, rated [40], while just warming up.

Next we look at $E(c↑2)$, hoping to get $V(c)$; 
$$\eqalign{E(c'↑2\mid c)&=c↑2-{t\over h}\bigl(2c\,E(i)-E(i↑2)\bigr)\cr
\noalign{\smallskip}
&=c↑2-{t\over h}\left(2c\cdot {c\over h-t+1}-{c(2c-h+t)\over(h-t+1)↑{\bar{2}}}
\right)\cr
\noalign{\smallskip}
&=c↑2\left(1-{2t\over h(h-t+1)}+{2t\over h(h-t+1)↑{\bar{2}}}\right)-c
{t(h-t)\over h(h-t+1)↑{\bar{2}}}\cr
\noalign{\smallskip}
&=c↑2{(h+2)(h-t)\over h(h-t+2)}-c{t(h-t)\over h(h-t+1)↑{\bar{2}}}\cr}$$
to which we apply the eigenvalue lemma. The eigenvector, conveniently,
is~$c↑{\bar{2}}$.

Notice that I could have pulled a rabbit out of my hat by starting off with
$E(c↑{\bar{2}})$; I~have shown the false start, to show how a systematic
approach uncovers the solution.

So,
$$\eqalign{E(c↑{\bar{2}'}\mid c)&=c↑{\bar{2}}{(h+2)(h-t)\over h(h-t+2)}\cr
\noalign{\smallskip}
E(c↑{\bar{2}'})&=E(c↑{\bar{2}}){(h+2)(h-t)\over h(h-t+2)}\,.\cr}$$
Letting $x↓t=E(c↑{\bar{2}})$ after $t$~insertions,
$$\eqalign{x↓0&=(h+1)↑{\bar{2}}\cr
\noalign{\smallskip}
x↓{j+1}&=x↓j{(h+2)(h-j)\over h(h-j+2)}\,.\cr}$$
Forming the product of both sides for $0≤j≤t-1$, and cancelling~$x$'s,
$$\eqalign{x↓t&=x↓0\left({h+2\over h}\right)↑t
{h!/(h-t)!\over (h+2)!/(h+2-t)!}\cr
\noalign{\smallskip}
E(c↑{\bar{2}})&=\left({h+2\over h}\right)↑t(h-t+1)↑{\bar{2}}\cr
\noalign{\smallskip}
V(c)&=E(c↑{\bar{2}})-E(c)-E(c)↑2\cr}$$
\vskip 10pt
%\noalign{\smallskip}
$$=\left({h+2\over h}\right)↑t(h-t+1)↑{\bar{2}}-\left({h+1\over h}\right)↑t
(h-t+1)-\left({h+1\over h}\right)↑{2t}(h-t+1)↑2\,.$$

Now some approximations:
Let $\alpha =t/h$, the load factor.
$$E(c)=\left({h+1\over h}\right)↑t(h-t+1)\approx t\,e↑{\alpha}\left(\,
{1\over \alpha}-1\right)\,.$$
Expected number of probes is
$$\eqalign{E(h+1-c)
&=h+1-\left({h+1\over h}\right)↑t(h-t+1)\approx t\left(\,{1\over\alpha}
-e↑{\alpha}\left(\,{1\over\alpha}-1\right)\right)\cr
\noalign{\smallskip}
&=t\left((e↑{\alpha}-1)\left(1-{1\over\alpha}\,\right)+1\right)=t\,f(\alpha)\cr}$$
{}
$$\vcenter{\halign{$#$\hfil\quad&$#$\hfil\quad&$#$\hfil\quad&$#$\hfil\quad%
&$#$\hfil\quad&$#$\hfil\quad&$#$\hfil\quad&$#$\hfil\quad&$#$\hfil\cr
\alpha&\epsilon&0.1&0.2&0.4&0.5&0.6&0.8&1.0\cr
\noalign{\smallskip}
f(\alpha)&\epsilon/2&0.053&0.114&0.262&0.351&0.452&0.694&1.000\cr}}$$

Approximating $V(c)$ is rather delicate, so we need (what else?) another
lemma or two.

\proclaim Lemma. Approximating $(1+y)↑p$.
If $y>0$, $(1+y)↑p$ lies between successive terms in the sequence
$1,\exp(py),\exp\bigl(p(y-y↑2/2)\bigr),\exp\bigl(p(y-y↑2/2+y↑3/3)\bigr),\ldots,
\prod↓{1≤i≤n}\exp\bigl(-p(-y)↑i/i),\ldots\,$.

\noindent{\bf Proof.}
If $x>0$, 
$$\eqalign{%
&1-(1+x)\sum↓{1≤i≤n}(-x)↑{i-1}=(-x)↑n\,;\cr
\noalign{\smallskip}
&{1\over 1+x}-\sum↓{1≤i≤n}(-x)↑{i-1}={(-x)↑n\over 1+x}\,;\cr
\noalign{\smallskip}
&\int↓0↑y{dx\over 1+x}-\int↓0↑y\sum↓{1≤i≤n}(-x)↑{-i}\,dx\qquad
\hbox{has ${\rm sign}(-1)↑n$, for $x>0$.}\cr
\noalign{\smallskip}
&{\rm sign}\left(p\ln(1+y)-p\sum↓{1≤i≤n}-{(-y)↑i\over i}\right)=(-)↑n\,.\cr}$$
Because ${\rm sign}(u-v)={\rm sign}(e↑u-e↑v)$,
$${\rm sign}\left((1+y)↑p-\exp\left(p\left(y-{y↑2\over 2}+{y↑3\over 3}
-\cdots -{(-x)↑n\over n}\right)\right)\right)=(-)↑n\,.$$

\proclaim Lemma: Approximating $e↑{-x}$ by its Taylor Series.
For $x>0$, $e↑{-x}$ is between successive terms of its Taylor series
$1-x+x↑2/2-x↑3/6+\cdots\,$.

\noindent{\bf Proof.}
Let $f↓i(x)$ be an arbitrary function that is negative if $i$ is odd,
positive if $i$ is even. By inductive hypothesis
$e↑{-x}=\sum↓{0≤i≤n}{(-x)↑i\over i!}-f↓n(x)$, so
$$e↑{-y}=1-\int↓0↑ye↑{-x}\,dx=1+\sum↓{0≤i≤n}{(-y)↑{i+1}\over
(i+1)!}+\int↓0↑yf↓n(x)\,dx
=\sum↓{0≤i≤n+1}{(-y)↑i\over i!}-f↓{n+1}(x)\,.$$
%$$\eqalign{e↑{-y}&=1-\int↓0↑ye↑{-x}\,dx&=1+\sum↓{0≤i≤n}{(-y)↑{i+1}\over
%(i+1)!}+\int↓0↑yf↓n(x)\,dx\cr
%\noalign{\smallskip}
%&=\sum↓{0≤i≤n+1}{(-y)↑i\over i!}+f↓{n+1}(x)\,.\cr}$$

Repeating,
$$V(c)=\left({h+2\over h}\right)↑t(h-t+1)↑{\bar{2}}-\left({h+1\over h}\right)↑t
(h-t+1)-\left({h+1\over h}\right)↑{2t}(h-t+1)↑2\,.$$
Because $V(c)$ is $O(t)$, we must not introduce any errors of approximation
unless their contribution is $o(t)$. Note that $\alpha$ and $e↑{\alpha}$ are
$O(1)$.
$$\eqalign{V(c)&\approx e↑{2\alpha}\left(1-{2t\over h↑2}\right)
(h-t+1)↑{\bar{2}}-e↑{\alpha}\left(1-{t\over 2h↑2}\right)(h-t)
-e↑{2\alpha}\left(1-{t\over h↑2}\right)(h-t+1)↑2\cr
\noalign{\smallskip}
&\approx e↑{2\alpha}(h-t+1)\left(\left(1-{2t\over h↑2}\right)(h-t+2)-
\left(1-{t\over h↑2}\right)(h-t+1)\right)-e↑{\alpha}(h-t)\cr
\noalign{\smallskip}
&=e↑{2\alpha}(h-t+1)\left(1+{t\over h↑2}(-2h+2t-4+h-t+1)\right)
-e↑{\alpha}(h-t)\cr
\noalign{\smallskip}
&\approx e↑{2\alpha}(h-t)\left(1+{t\over h↑2}(-h+t)\right)-e↑{\alpha}(h-t)\cr
\noalign{\smallskip}
&=e↑{2\alpha}\left((h-t)-{t(h-t)↑2\over h↑2}\right)-e↑{\alpha}(h-t)\cr
\noalign{\smallskip}
&=t\left({e↑{2\alpha}-e↑{\alpha}\over\alpha}(1-\alpha)-e↑{2\alpha}(1-\alpha)↑2
\right)=t\,g(\alpha)\,.\cr}$$
{}
$$\vcenter{\halign{$#$\hfil\quad&$#$\hfil\quad&$#$\hfil\quad&$#$\hfil\quad%
&$#$\hfil\quad&$#$\hfil\quad&$#$\hfil\cr
\alpha&0.2&0.4&0.6&0.7&0.8&0.9\cr
\noalign{\smallskip}
g(\alpha)&0.123&0.299&0.467&0.510&0.484&0.338\cr}}$$

We conclude that the movement of the cellar pointer is not costly in
practice (symbol tables are seldom even half full, so probably much
less than $t/3$ is a safe conjecture for the typical number of moves).
The standard deviation is never more than about $\sqrt{t/2}$, so
performance is very consistent.




\bigskip
\parindent0pt
\copyright 1986 Robert W. Floyd.
First draft (not published)
September 29, 1986.
%revised date;
%subsequently revised.

\bye
\proclaim Lemma. For $x>0${}
$$\left.\eqalign{1-x&\cr
1-x+x↑2-x↑3&\cr
\hbox{\rm etc.}&\cr}
\right\}<{1\over 1+x}<
\left\{\eqalign{&1\cr
&1-x+x↑2\cr
&1-x+x↑2-x↑3+x↑4\cr
&\hbox{\rm etc.}\cr}
\right.$$
as can be seen multiplying by $1+x$. Integrating, we get for $y>0$
$$\left.\eqalign{0&\cr
y-y↑2/2&\cr
y-y↑2/2+y↑3/3-y↑4/4&\cr
\hbox{\rm etc.}&\cr}
\right\}<
\int↓0↑y{1\over 1+x}\,dx=\ln(1+y)<
\left\{\eqalign{&y\cr
&y-y↑2/2+y↑3/3\cr
&\hbox{\rm etc.}\cr}
\right.$$
Together with $\ln\bigl((1+y)↑p\bigr)=p\ln(1+y)$, for positive~$p$
and~$y$, exponentiating gives
$$\left.\eqalign{1&\cr
\exp(px-px↑2/2)&\cr
\hbox{\rm etc.}&\cr}
\right\}<
(1+x)↑p<
\left\{\eqalign{&\exp(px)\cr
&\exp(px-px↑2/2+px↑3/3\cr
&\hbox{\rm etc.}\cr}
\right.$$
so we can say
$$(1+x)↑p=\exp(px)\cdot\exp(-px↑2/2)\cdot\exp(px↑3/3)\cdots\;.$$
Another useful inequality: because $e↑{-x}=1-\int↓{-x}↑0e↑{-y}\,dy$,
and $e↑{-y}>0$, for $x>0$ successive integrations give
$$e↑{-x}\left\{\eqalign{&<1\cr
&>1-x\cr
&<1-x+x↑2/2\cr
&>1-x+x↑2/2-x↑3/6\cr
&\quad\hbox{\rm etc.}\cr}
\right.$$

\bye